Croatian genetic heritage: an updated Y-chromosome story

Aim To analyze an additional set of Y-chromosome genetic markers to acquire a more detailed insight into the diversity of the Croatian population. Methods A total of 518 Yfiler Plus profiles were genotyped. Allele frequencies, haplotype frequencies, and haplotype diversity were calculated by using the STRAF software v. 2.0.4. Genetic distances were quantified by Rst with AMOVA online tool from the YHRD. The evolutionary history was inferred with the neighbor-joining method of phylogenetic tree construction in the MEGAX software. Whit Athey's Haplogroup Predictor v. 5 was used for additional comparison with regional and other European populations. Results A total of 507 haplotypes were used for genetic STR analysis. An interpopulation study on 17 Y-STR markers showed the lowest genetic diversity between the Croatian and Bosnian-Herzegovinian populations and the highest between the Croatian and Irish populations. Additional interpopulation comparison with the original 27 Y-STR markers (for the population with available data) was also performed. A total of 518 haplotypes were used in the determination of haplogroup diversity. Haplogroup I with its sublineage I2a expressed the highest prevalence. The second most prevalent haplogroup was R, with its major sublineage R1a, except for the subpopulation of Hvar, where E1b1b was the second most prevalent haplogroup. Rare haplogroups also confirmed in this study were L, T, and Q. G1 was detected for the first time in the Croatian population. Conclusion We obtained a new insight into the differences between examined subpopulations of Croatia and their possible (dis)similarities with neighboring and distant populations.

The Y chromosome ( ∼ 60 Mb) is relatively small and inherited from father to son unchanged (apart from for occasional mutations). Except for the small pseudoautosomal regions (PAR), there is no recombination between the X and Y chromosome (1)(2)(3). This is why haplotype inheritance through the male lineage can be tracked and analyzed (2,(4)(5)(6).
The Y chromosome mostly consists of repetitive sequences (around 50%), which are single-base substitutions, Alu elements, and long interspersed nuclear elements (LINEs). Short tandem repeats (STRs) as repetitive elements are the base of population genetic studies. Their average mutational frequency is ∼ 0.2% per generation (7,8).
Y haplogroup can be defined as a part of the Y-chromosome family related by ancestry and determined by a specific set of Y chromosomal single nucleotide polymorphisms (Y-SNPs). It is important to better understand the demographic processes that shaped modern populations (8,9). The low mutation rate makes Y-SNP markers suitable for the conventional method of Y haplogroup defining.
Y-chromosome haplogroups can also be successfully predicted from Y-STR markers (Y-STR haplotype) by using Y-STR haplogroup predicting tools. Lately, this method has drawn attention due to its effectiveness in terms of labor, time, and costs (10). The haplotype helps us to analyze the influence of genes on disease-related alleles and represents the set of alleles on the same chromosome. On the other hand, major haplogroups (branches of Y-chromosome phylogeny), labeled A-T, reflect the establishment and expansion of major population groups and can indicate the time scale and the route of major migration events (11).
The main aim of this research is to update the information about Croatian Y-chromosome diversity by using additional Y-STR loci to compare new results with the previously published results generated using Y-STR and Y-SNP markers (12)(13)(14)(15)(16)(17)(18)(19)(20). A secondary aim was to analyze the genetic structure of five regional subpopulations (with the local centers in Osijek, Pula, Varaždin, Split, and Hvar Island) by identifying the most common haplogroups in these regions. The analysis also included the genetic differences between these five subpopulations and their differences with neighboring countries.

Statistical analysis
Allele and haplotype frequencies, the number of alleles and different haplotypes, as well as gene and haplotype diversity were estimated to assess the intrapopulation diversity.
Nei's formula HD = (1-∑ p i 2 )*n/(n-1) was used to calculate haplotype diversity; where n is the sample size and p i is the i th haplotype frequency. Gene diversity was calculated as 1-∑ p i 2 , where p i is the allele frequency. Match probability (MP) was calculated with the formula ∑ p i 2 , where p i is the frequency of the i th haplotype. Discrimination capacity (DC) was determined by dividing the number of haplotypes by the number of individuals in the population (21,22). STRAF software package v. 2.0.4 was used to calculate allele and haplotype frequencies. The same software was used to calculate gene and haplotype diversity (23,24). R st , calculated by AMOVA online tool from the YHRD, was used to quantify genetic distances between groups of men and between populations (25,26). Associated probability values (P values) with 10 000 permutations were in-cluded for the studied populations. The multidimensional scaling plots (MDS) showing the comparison of population haplotype data from YHRD were generated by using genetic distances.
AMOVA analysis was performed with two population groups. The number of the populations with available data for 27 STR loci was relatively small, especially in the clos-est Croatian neighborhood. Therefore, the first group was analyzed by comparing the 17 Y-STR loci included in the AmpFLSTR Yfiler PCR Amplification Kit. The second group was analyzed by comparing the whole set of 27 Y-STR loci included in the Yfiler Plus PCR Amplification Kit.
The first group of European populations selected for comparison with the population of Croatia by using 17 Y-STRs    Available population data and all related references are included in the YHRD (25,26). The evolutionary history was inferred for both sets of markers by using the neighbor-joining (NJ) method of phylogenetic tree construction (27) in MEGAX (28), whereby the optimal tree is shown.
Y-chromosomal haplogroup prediction with allele frequencies on 518 Yfiler Plus profiles was performed by using Whit Athey's Haplogroup Predictor v. 5, an algorithm based on the Bayesian allele-frequency approach (29,30).

ReSuLTS AND DISCuSSIoN
A total of 518 haplotypes were detected and used for haplogroup prediction. Eleven haplotypes were considered newly detected microvariants, which required additional analysis for confirmation. Therefore, the remaining 507 haplotypes (the ones without newly detected microvariants) were used for additional statistical analysis. On a sample of fully genotyped 507 Y-STR profiles, 502 different haplotypes were detected, with 497 unique haplotypes and 5 haplotypes appearing twice. In addition, 196 alleles at 27 Y-STR loci were detected ( Table 1). The loci with the largest number of detected alleles were the DYS385a/b double locus (DYS385a had 8 alleles and DYS385/b had 10 alleles) and DYS481 (14 detected alleles). The loci with the smallest number of detected alleles (only four alleles each) were DYS393, DYS437, and Y-GATA-H4.
The haplotype diversity for the studied population was 1.0000 ± 0.0014, with DC of 1.00 and MP of 0.01. Genetic diversity ranged from 0.886 for DYS481 to 0.251 for DYS392. The genetic diversity average across all loci was 0.656. With six detected alleles and the lowest genetic diversity, DYS392 was one of the least polymorphic loci in the studied population. Therefore, it was not surprising that, with a frequency of 0.862, allele 11 at DYS392 was the most common allele (Table 1).
In order to determine additional genetic differences, an interpopulation analysis was done between five regions: Hvar (n = 104), Osijek (n = 109), Pula (n = 94), Split (n = 102), and Varaždin (n = 98). The lowest genetic diversity observed for the population of Hvar was compared with the population of Split (Rst = 0.0009, P = 0.3240). The greatest genetic diversity observed for the population of Hvar was     www.cmj.hr P < 0.0001). A higher genetic distance was observed when the study population was compared with other European and worldwide populations.
To further investigate molecular evolutionary relationships between the geographical subpopulations of Croatia, NJ phylogenetic trees were constructed based on Rst values for different regions of Croatia (Figure 1).
Genetic relationships between the investigated populations are shown in MDS plots ( Figure 1, Figures 2, and Figure 3). The results of such comparisons confirm the general trends that were shown in Supplementary Table 1 and  Supplementary Table 2.
The NJ phylogenetic tree shows the genetic relationships and clustering between five Croatian regions based on the population study on 27 Y-STR markers (Figure 1). Hvar and Split subpopulations are clustered together.
Osijek and Pula subpopulations are in a separate cluster. The population of the Varaždin region is in a cluster on a different branch, which may indicate its genetic specificity (probably linked with geographical position) relative to the other four examined regional subpopulations.
In the MDS plot showing the group of populations compared by using a set of 17 Y-STR markers, as expected, the Croatian population is clustered with the geographically close populations of Bosnia and Herzegovina, Serbia, and Slovenia. The populations of Ukraine, Bulgaria, and Hungary are also clustered close to these four populations. The next closest European populations are North Macedonia, Albania, and the Czech Republic ( Figure 2).
In the MDS plot showing the group of populations compared by using a set of 27 Y-STR markers, the Croatian population is closely clustered with the geographically close populations of Serbia and Slovenia. The populations of Poland, Hungary, and the Russian Federation are also clustered relatively close to these three populations. The next closest European population is that of North Macedonia (Figure 3). The Croatian population was not compared with some neighboring populations (ie, Bosnia and Herzegovina and Montenegro) and a few others that showed clustering when analyzed on 17 Y-STR markers (eg, Bulgaria, Albania, Ukraine, etc) since data on 27 Y-STR markers are unavailable for these populations.
Our study showed a high degree of homogeneity of the Croatian population. Certain genetic simi-larity was observed at the regional level (between the population of the Pula region and Serbian population; Rst = 0.0063, P = 0.1013), and between the population of the Varaždin region and the neighboring Slovenian population; Rst = -0.0002, P = 0.4124). These results prove again that the Y-chromosome is expected to show greater geographical clustering than other population markers (2,16), but also could potentially mark immigrational impacts from the eastern neighboring countries, such as those in the Istrian region, most probably in the second half of the 20th century. However, these similarities still should be confirmed by additional analysis and increasing/structuring the sample size of the Pula and Varaždin region.
Four of the five subpopulations of Croatia showed expected results ( Figure 5A-C). High frequency of haplogroup I was reported with its known sublineage I2a in the subpopulations: Hvar 52.88%, Split 50.48%, Osijek 40.91%, and Pula 31.31%. Previously published reports demonstrate similar results (12,13,16,18,19). However, slightly different results were obtained in the subpopulation of Varaždin ( Figure  5D). In this subpopulation, R1a was the most frequent haplogroup with a frequency of 38%, while the frequency for I2a haplogroup was 18%. Interestingly, R1a was also the dominant haplogroup within the Slovenian population (31), which is the closest neighboring abroad population to Varaždin county. However, as we have already stated, these initially notified similarities still should be confirmed by additional analysis and by increasing/structuring the sample size of the Varaždin region ( Figure 5).
In summary, sublineage I2a was generally the most frequent haplogroup in the populations of Croatia in this study, but also in all previously studies (12,13,16,18,19,32). Similar results were obtained in an earlier study, when I-P37 (a former name for the most I2a sublineage) for the Croatian population in Bosnia and Herzegovina was detected at a ratio of 71.1% (15).
Haplogroup I (previously described as Eu7) (12) arrived to the Balkans around 25 000 years ago from the Middle East through Anatolia (9,16). One scenario suggests the possibility of population expansion from one of the post-Glacial refugia into the rest of the Balkan Peninsula (15). There is also a possibility that this haplogroup could be connected with more recent population movements from Eastern Europe, but this idea still has to be examined (9). Definitively, when compared with the other populations in Europe, the I2a haplogroup sublineage is considered a characteristic Southeast European haplogroup (33).
The R1a (previously described as Eu19) (12), as a leading sublineage of haplogroup R, was the second most frequent haplogroup in the studied population of Croatia, with an overall frequency of 24.32%. The following prevalences of haplogroup R1a in the subpopulations of Croatia were reported: Varaždin 38%, Pula 28.28%, Osijek 26.36%, and Split 19.05%. In the subpopulation of Hvar, a small genetic deviation in the frequency of haplogroups R1a and E1b1b was reported. The R1a haplogroup accounted for 10.58%, just slightly lower than haplogroup E1b1b with the frequency of 11.54% ( Figure 5A). This is most likely due to the founder effect, which is expected for island populations. In previously reported studies on the mainland population of Croatia, haplogroup R was reported as the second most frequent (13,18,32). Migration theories of R1a origins indicate the outflow of haplogroup R from West Asia to the Balkans as a post-last glacial maximum (LGM) event during the Mesolithic (16,34).
Sublineage R1b (previously described as Eu18) (12) showed a lower frequency in the studied population. The overall frequency of the R1b sublineage for the population of Croatia amounts to 6.37%. The highest frequency of R1b haplogroup was reported in the subpopulation of Varaždin, with a prevalence of 9%, and in Pula, with a frequency of 8.08%. The most similar results were obtained in the Bosnian population based on 481 Y-STR profiles, whereby R1b accounted for 8.75% of the samples (10).
There are two theories about E1b1b arrival in Europe. One theory is a post-LGM event from Asia and Africa during the Neolithic period, while the other theory suggests that this haplogroup is Balkan-specific, and originated around 8000 years ago during Greek colonization in the northern part of the Peninsula (16,35). This ancient European haplogroup shows its possible dual origin from two different source populations, during the recolonization of Europe from Iberia and from West Asia (16,32).
An approximate comparison between the frequencies of the earlier used Y chromosome lineage (Eu) determined by Semino et al (12) and the frequency of the currently used haplogroups detected in the Croatian population is shown in Table 3. The exact comparison is not possible because current nomenclature offers a more detailed and precise insight into Y chromosome diversity. However, this table could approximate a comparison between early and currently detected Y chromosome diversity within the Croatian population.
Rare haplogroups discovered in this study were Q, T, L and G1, each present in 1.93%, 0.58%, 0.19%, and 0.19% of all samples, respectively. Haplogroup L is associated with South Asia and India but is also found in low frequencies in Central Asia, Southwest Asia, and Southern Europe. With its alternative phylogenetic name K1a, haplogroup L is closely related to haplogroup T (36). Haplogroup T (phylogenetic name K1b) originates from Western Asia, spreading to East Africa, South Asia, and Southern Europe (37,38). Haplogroup Q is the only Pan-American haplogroup and confirms the Asian origin of Native Americans (39). It provides insight into the main Asian-American migrations. We detected haplogroup G1 for the first time within the Croatian population. This haplogroup is found predominantly in the Eurasian population, particularly in Iran, and is very rare in Europe. Some authors suggest that this rare haplogroup could have been related to the expansion of Iranian speakers northwards to the Eurasian steppe (40). However, its origin is still not clearly described.
This study provided a more detailed insight into the genetic diversity of the subpopulations of Croatia. Furthermore, our results generally confirmed previous results. Rst statistics was used to compare 27 Y-STR loci data from the studied Croatian population with data of other populations available from YHRD. The results indicate that the Croatian population does not deviate significantly from the neighboring populations of Bosnia and Herzegovina, Slovenia, and Serbia. This proves that the Y chromosome genetic marker has a noticeable geographical background (2,15), and this analysis resulted in expected geographic clustering.
Most of the Croatian men ("owners" of HgI, R1a, and R1b) harbor the ancestral genetic impact of Old European people who settled in Europe approximately 25 000-30 000 years ago and survived the LGM in several different refugia (16). Our results on new additional Y-STR loci confirmed that more than 78% of the contemporary Croatians are included in that group. The rest of the population relates to the people who arrived mostly during the Neolithization process. A small portion of the examined population originated from the "owners" of rare haplogroups in terms of European genetic diversity, the origin of which is still not clarified. The use of additional Y-STR loci provided detailed insight and supplementary information regarding genetic diversity of the Croatian population.
The limitation of the study is that some of the regional populations could not be compared with the Croatian population due to the lack of published data. Therefore, two levels of the comparison were necessary. Furthermore, current results for Pula and Varaždin subpopulations indicate somewhat greater similarity to regional populations. Although this could be explained based on historical facts, more detailed structuring of these populations should be performed to confirm the obtained results.